{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "**作者**：Gaël Varoquaux\n",
    "\n",
    "**Donald Knuth**\n",
    "\n",
    "\"过早的优化是一切罪恶的根源\"\n",
    "\n",
    "---\n",
    "\n",
    "本章处理用策略让Python代码跑得更快。\n",
    "\n",
    "---\n",
    "\n",
    "**先决条件**\n",
    "\n",
    "- line_profiler\n",
    "- gprof2dot\n",
    "- 来自dot实用程序\n",
    "\n",
    "---\n",
    "\n",
    "---\n",
    "\n",
    "**章节内容**\n",
    "\n",
    "- 优化工作流\n",
    "- 剖析Python代码\n",
    "    - Timeit\n",
    "    - Profiler\n",
    "    - Line-profiler\n",
    "    - Running `cProfile`\n",
    "    - Using `gprof2dot`\n",
    "- 让代码更快\n",
    "    - 算法优化\n",
    "        - SVD的例子\n",
    "- 写更快的数值代码\n",
    "    - 其他的链接\n",
    "\n",
    "## 2.4.1 优化工作流\n",
    "\n",
    "1. 让它工作起来：用简单清晰的方式来写代码。\n",
    "2. 让它可靠的工作：写自动的测试案例，以便真正确保你的算法是正确的，并且如果你破坏它，测试会捕捉到。\n",
    "3. 通过剖析简单的使用案例找到瓶颈，并且加速这些瓶颈，寻找更好的算法或实现方式来优化代码。记住在剖析现实例子时简单和代码的执行速度需要进行一个权衡。要有效的运行，最好让剖析工作持续10s左右。\n",
    "\n",
    "## 2.4.2剖析Python代码\n",
    "\n",
    "**无测量无优化！**\n",
    "\n",
    "- **测量**: 剖析, 计时\n",
    "- 你可能会惊讶：最快的代码并不是通常你想的样子\n",
    "\n",
    "### 2.4.2.1 Timeit\n",
    "在IPython中，使用timeit(http://docs.python.org/library/timeit.html)来计时基本的操作："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 60.37 times longer than the fastest. This could mean that an intermediate result is being cached \n",
      "100000 loops, best of 3: 1.99 µs per loop\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "a = np.arange(1000)\n",
    "\n",
    "%timeit a ** 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10000 loops, best of 3: 45.1 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit a ** 2.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 12.79 times longer than the fastest. This could mean that an intermediate result is being cached \n",
      "100000 loops, best of 3: 1.86 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit a * a"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "用这个信息来指导在不同策略间进行选择。\n",
    "\n",
    "---\n",
    "\n",
    "**笔记**：对于运行时间较长的单元，使用`%time`来代替`%timeit`; 它准确性较差但是更快。\n",
    "\n",
    "---\n",
    "\n",
    "### 2.4.2.2 Profiler\n",
    "\n",
    "当你有个大型程序要剖析时比较有用，例如[下面这个程序](http://scipy-lectures.github.io/_downloads/demo.py)："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# For this example to run, you also need the 'ica.py' file\n",
    "\n",
    "import numpy as np\n",
    "from scipy import linalg\n",
    "\n",
    "from ica import fastica\n",
    "\n",
    "\n",
    "def test():\n",
    "    data = np.random.random((5000, 100))\n",
    "    u, s, v = linalg.svd(data)\n",
    "    pca = np.dot(u[:, :10].T, data)\n",
    "    results = fastica(pca.T, whiten=False)\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    test()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "**笔记**：这种技术是两个非监督学习技术的组合，主成分分析（PCA）和独立成分分析（[ICA](http://scipy-lectures.github.io/advanced/optimizing/index.html#id16)）。PCA是一种降维技术，即一种用更少的维度解释数据中观察到的变异的算法。ICA是一种源信号分离技术，例如分离由多个传感器记录的多种信号。如果传感器比信号多，那么先进行PCA然后ICA会有帮助。更多的信息请见：[来自scikits-learn的FastICA例子](http://scikit-learn.org/stable/auto_examples/decomposition/plot_ica_blind_source_separation.html)。\n",
    "\n",
    "---\n",
    "\n",
    "要运行它，你也需要下载[ica模块](http://scipy-lectures.github.io/_downloads/ica.py)。在IPython我们计时这个脚本："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "IPython CPU timings (estimated):\n",
      "  User   :       6.62 s.\n",
      "  System :       0.17 s.\n",
      "Wall time:       3.72 s.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/cloga/Documents/scipy-lecture-notes_cn/ica.py:65: RuntimeWarning: invalid value encountered in sqrt\n",
      "  W = (u * np.diag(1.0/np.sqrt(s)) * u.T) * W  # W = (W * W.T) ^{-1/2} * W\n",
      "/Users/cloga/Documents/scipy-lecture-notes_cn/ica.py:90: RuntimeWarning: invalid value encountered in absolute\n",
      "  lim = max(abs(abs(np.diag(np.dot(W1, W.T))) - 1))\n"
     ]
    }
   ],
   "source": [
    "%run -t demo.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "并且剖析它：\n",
    "\n",
    "```\n",
    "%run -p demo.py\n",
    "\n",
    "         301 function calls in 3.746 seconds\n",
    "\n",
    "   Ordered by: internal time\n",
    "\n",
    "   ncalls  tottime  percall  cumtime  percall filename:lineno(function)\n",
    "        1    3.714    3.714    3.715    3.715 decomp_svd.py:15(svd)\n",
    "        1    0.019    0.019    3.745    3.745 demo.py:3(<module>)\n",
    "        1    0.007    0.007    0.007    0.007 {method 'random_sample' of 'mtrand.RandomState' objects}\n",
    "       14    0.003    0.000    0.003    0.000 {numpy.core._dotblas.dot}\n",
    "        1    0.001    0.001    0.001    0.001 function_base.py:550(asarray_chkfinite)\n",
    "        2    0.000    0.000    0.000    0.000 linalg.py:1116(eigh)\n",
    "        1    0.000    0.000    3.745    3.745 {execfile}\n",
    "        2    0.000    0.000    0.001    0.000 ica.py:58(_sym_decorrelation)\n",
    "        2    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}\n",
    "        1    0.000    0.000    0.000    0.000 ica.py:195(gprime)\n",
    "        1    0.000    0.000    0.001    0.001 ica.py:69(_ica_par)\n",
    "        1    0.000    0.000    3.726    3.726 demo.py:9(test)\n",
    "        1    0.000    0.000    0.001    0.001 ica.py:97(fastica)\n",
    "        1    0.000    0.000    0.000    0.000 ica.py:192(g)\n",
    "       23    0.000    0.000    0.000    0.000 defmatrix.py:290(__array_finalize__)\n",
    "        4    0.000    0.000    0.000    0.000 twodim_base.py:242(diag)\n",
    "        1    0.000    0.000    3.746    3.746 interactiveshell.py:2616(safe_execfile)\n",
    "       10    0.000    0.000    0.000    0.000 {numpy.core.multiarray.array}\n",
    "        1    0.000    0.000    3.745    3.745 py3compat.py:279(execfile)\n",
    "        1    0.000    0.000    0.000    0.000 {method 'normal' of 'mtrand.RandomState' objects}\n",
    "       50    0.000    0.000    0.000    0.000 {isinstance}\n",
    "       10    0.000    0.000    0.000    0.000 defmatrix.py:66(asmatrix)\n",
    "       10    0.000    0.000    0.000    0.000 defmatrix.py:244(__new__)\n",
    "        9    0.000    0.000    0.000    0.000 numeric.py:394(asarray)\n",
    "        1    0.000    0.000    0.000    0.000 _methods.py:53(_mean)\n",
    "        1    0.000    0.000    0.000    0.000 {posix.getcwdu}\n",
    "        4    0.000    0.000    0.000    0.000 {method 'astype' of 'numpy.ndarray' objects}\n",
    "        6    0.000    0.000    0.000    0.000 defmatrix.py:338(__mul__)\n",
    "        2    0.000    0.000    0.000    0.000 linalg.py:139(_commonType)\n",
    "        4    0.000    0.000    0.000    0.000 {method 'view' of 'numpy.ndarray' objects}\n",
    "        1    0.000    0.000    0.000    0.000 posixpath.py:329(normpath)\n",
    "        5    0.000    0.000    0.000    0.000 {abs}\n",
    "        1    0.000    0.000    0.000    0.000 {open}\n",
    "        1    0.000    0.000    0.000    0.000 blas.py:172(find_best_blas_type)\n",
    "        1    0.000    0.000    0.000    0.000 blas.py:216(_get_funcs)\n",
    "        1    0.000    0.000    0.000    0.000 syspathcontext.py:64(__exit__)\n",
    "        3    0.000    0.000    0.000    0.000 {max}\n",
    "        6    0.000    0.000    0.000    0.000 {method 'transpose' of 'numpy.ndarray' objects}\n",
    "        1    0.000    0.000    0.000    0.000 posixpath.py:120(dirname)\n",
    "        2    0.000    0.000    0.000    0.000 linalg.py:101(get_linalg_error_extobj)\n",
    "        2    0.000    0.000    0.000    0.000 linalg.py:106(_makearray)\n",
    "        3    0.000    0.000    0.000    0.000 {numpy.core.multiarray.zeros}\n",
    "        6    0.000    0.000    0.000    0.000 defmatrix.py:928(getT)\n",
    "        1    0.000    0.000    0.000    0.000 syspathcontext.py:57(__enter__)\n",
    "        2    0.000    0.000    0.000    0.000 linalg.py:209(_assertNdSquareness)\n",
    "        7    0.000    0.000    0.000    0.000 {issubclass}\n",
    "        4    0.000    0.000    0.000    0.000 {getattr}\n",
    "        1    0.000    0.000    0.000    0.000 posixpath.py:358(abspath)\n",
    "        5    0.000    0.000    0.000    0.000 {method 'startswith' of 'unicode' objects}\n",
    "        2    0.000    0.000    0.000    0.000 linalg.py:198(_assertRankAtLeast2)\n",
    "        2    0.000    0.000    0.000    0.000 {method 'encode' of 'unicode' objects}\n",
    "       10    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}\n",
    "        1    0.000    0.000    0.000    0.000 _methods.py:43(_count_reduce_items)\n",
    "        1    0.000    0.000    0.000    0.000 {method 'all' of 'numpy.ndarray' objects}\n",
    "        4    0.000    0.000    0.000    0.000 linalg.py:124(_realType)\n",
    "        1    0.000    0.000    0.000    0.000 syspathcontext.py:54(__init__)\n",
    "        1    0.000    0.000    0.000    0.000 posixpath.py:61(join)\n",
    "        1    0.000    0.000    3.746    3.746 <string>:1(<module>)\n",
    "        1    0.000    0.000    0.000    0.000 _methods.py:40(_all)\n",
    "        4    0.000    0.000    0.000    0.000 linalg.py:111(isComplexType)\n",
    "        2    0.000    0.000    0.000    0.000 {method '__array_prepare__' of 'numpy.ndarray' objects}\n",
    "        4    0.000    0.000    0.000    0.000 {min}\n",
    "        1    0.000    0.000    0.000    0.000 py3compat.py:19(encode)\n",
    "        1    0.000    0.000    0.000    0.000 defmatrix.py:872(getA)\n",
    "        2    0.000    0.000    0.000    0.000 numerictypes.py:949(_can_coerce_all)\n",
    "        6    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}\n",
    "        1    0.000    0.000    0.000    0.000 numerictypes.py:970(find_common_type)\n",
    "        1    0.000    0.000    0.000    0.000 {method 'mean' of 'numpy.ndarray' objects}\n",
    "       11    0.000    0.000    0.000    0.000 {len}\n",
    "        1    0.000    0.000    0.000    0.000 numeric.py:464(asanyarray)\n",
    "        1    0.000    0.000    0.000    0.000 {method '__array__' of 'numpy.ndarray' objects}\n",
    "        1    0.000    0.000    0.000    0.000 {method 'rfind' of 'unicode' objects}\n",
    "        2    0.000    0.000    0.000    0.000 {method 'upper' of 'str' objects}\n",
    "        1    0.000    0.000    0.000    0.000 posixpath.py:251(expanduser)\n",
    "        3    0.000    0.000    0.000    0.000 {method 'setdefault' of 'dict' objects}\n",
    "        1    0.000    0.000    0.000    0.000 {method 'diagonal' of 'numpy.ndarray' objects}\n",
    "        1    0.000    0.000    0.000    0.000 lapack.py:239(get_lapack_funcs)\n",
    "        1    0.000    0.000    0.000    0.000 {method 'rstrip' of 'unicode' objects}\n",
    "        1    0.000    0.000    0.000    0.000 py3compat.py:29(cast_bytes)\n",
    "        1    0.000    0.000    0.000    0.000 posixpath.py:52(isabs)\n",
    "        1    0.000    0.000    0.000    0.000 {method 'split' of 'unicode' objects}\n",
    "        1    0.000    0.000    0.000    0.000 {method 'endswith' of 'unicode' objects}\n",
    "        1    0.000    0.000    0.000    0.000 {sys.getdefaultencoding}\n",
    "        1    0.000    0.000    0.000    0.000 {method 'insert' of 'list' objects}\n",
    "        1    0.000    0.000    0.000    0.000 {method 'remove' of 'list' objects}\n",
    "        1    0.000    0.000    0.000    0.000 {method 'join' of 'unicode' objects}\n",
    "        1    0.000    0.000    0.000    0.000 {method 'index' of 'list' objects}\n",
    "        1    0.000    0.000    0.000    0.000 misc.py:126(_datacopied)\n",
    "        1    0.000    0.000    0.000    0.000 {sys.getfilesystemencoding}\n",
    "        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}\n",
    "```\n",
    "\n",
    "很明显`svd`（**decomp.py**中）占用了最多的时间，换句话说，是瓶颈。我们要找到方法让这个步骤跑的更快，或者避免这个步骤（算法优化）。在其他部分花费时间是没用的。\n",
    "\n",
    "### 2.4.2.3 Line-profiler\n",
    "\n",
    "profiler很棒：它告诉我们哪个函数花费了最多的时间，但是，不是它在哪里被调用。\n",
    "\n",
    "关于这一点，我们使用[line_profiler](http://packages.python.org/line_profiler/)：在源文件中，我们用@profile（不需要导入它）修饰了一些想要用检查的函数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@profile\n",
    "def test():\n",
    "    data = np.random.random((5000, 100))\n",
    "    u, s, v = linalg.svd(data)\n",
    "    pca = np.dot(u[: , :10], data)\n",
    "    results = fastica(pca.T, whiten=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接着我们用[kernprof.py](http://packages.python.org/line_profiler/kernprof.py)来运行这个脚本，开启`-l, --line-by-line`和`-v, --view`来使用逐行profiler，并且查看结果并保存他们：\n",
    "\n",
    "```\n",
    "kernprof.py -l -v demo.py\n",
    "\n",
    "Wrote profile results to demo.py.lprof\n",
    "Timer unit: 1e-06 s\n",
    "\n",
    "File: demo.py\n",
    "Function: test at line 5\n",
    "Total time: 14.2793 s\n",
    "\n",
    "Line #      Hits         Time  Per Hit   % Time  Line Contents\n",
    "==============================================================\n",
    "    5                                           @profile\n",
    "    6                                           def test():\n",
    "    7         1        19015  19015.0      0.1      data = np.random.random((5000, 100))\n",
    "    8         1     14242163 14242163.0   99.7      u, s, v = linalg.svd(data)\n",
    "    9         1        10282  10282.0      0.1      pca = np.dot(u[:10, :], data)\n",
    "   10         1         7799   7799.0      0.1      results = fastica(pca.T, whiten=False)\n",
    "\n",
    "```\n",
    "\n",
    "SVD占用了几乎所有时间，我们需要优化这一行。\n",
    "\n",
    "### 2.4.2.4 运行`cProfile`\n",
    "\n",
    "在上面的IPython例子中，Ipython只是调用了内置的[Python剖析器](http://docs.python.org/2/library/profile.html)`cProfile`和`profile`。如果你想要用一个可视化工具来处理剖析器的结果，这会有帮助。\n",
    "\n",
    "```\n",
    "python -m cProfile -o demo.prof demo.py\n",
    "```\n",
    "\n",
    "使用`-o`开关将输入剖析器结果到文件`demo.prof`。\n",
    "\n",
    "### 2.4.2.5 使用`gprof2dot`\n",
    "\n",
    "如果你想要更加视觉化的剖析器输入结果，你可以使用[gprof2dot](http://code.google.com/p/jrfonseca/wiki/Gprof2Dot)工具："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "gprof2dot -f pstats demo.prof | dot -Tpng -o demo-prof.png"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这会生成下面的图片：\n",
    "\n",
    "![](http://scipy-lectures.github.io/_images/demo-prof.png)\n",
    "\n",
    "这种方法打印了一个类似前一种方法的图片。\n",
    "\n",
    "## 2.4.3 让代码更快\n",
    "\n",
    "一旦我们识别出瓶颈，我们需要让相关的代码跑得更快。\n",
    "\n",
    "### 2.4.3.1 算法优化\n",
    "\n",
    "第一件要看的事情是算法优化：有没有计算量更小的方法或者更好的方法？\n",
    "\n",
    "从更高的视角来看这个问题，对算法背后的数学有一个很好的理解会有帮助。但是，寻找到像**将计算或内存分配移到循环外**这样的简单改变，来带来巨大的收益，通常很困难。\n",
    "\n",
    "#### 2.4.3.1.1 SVD的例子\n",
    "\n",
    "在上面的两个例子中，SVD - [奇异值分解](http://en.wikipedia.org/wiki/Singular_value_decomposition) - 花费了最多的时间。确实，这个算法的计算成本大概是输入矩阵大小的$n^3$。\n",
    "\n",
    "但是，在这些例子中，我们并不是使用SVD的所有输出，而只是它第一个返回参数的前几行。如果我们使用scipy的`svd`实现，我们可以请求一个这个SVD的不完整版本。注意scipy中的线性代数实现比在numpy中更丰富，应该被优选选用。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loops, best of 3: 4.12 s per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit np.linalg.svd(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loops, best of 3: 3.65 s per loop\n"
     ]
    }
   ],
   "source": [
    "from scipy import linalg\n",
    "%timeit linalg.svd(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 70.5 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit linalg.svd(data, full_matrices=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 70.3 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit np.linalg.svd(data, full_matrices=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来我们可以用这个发现来[优化前面的代码](http://scipy-lectures.github.io/_downloads/demo_opt.py)："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import demo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loops, best of 3: 3.65 s per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit demo.test()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import demo_opt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 81.9 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit demo_opt.test()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "真实的非完整版SVD，即只计算前十个特征向量，可以用arpack来计算，可以在`scipy.sparse.linalg.eigsh`找到。\n",
    "\n",
    "---\n",
    "**计算线性代数**\n",
    "\n",
    "对于特定的算法，许多瓶颈会是线性代数计算。在这种情况下，使用正确的方法来解决正确的问题是关键。例如一个对称矩阵中的特征向量问题比通用矩阵中更好解决。同样，更普遍的是，你可以避免矩阵逆转，使用一些成本更低（在数字上更可靠）的操作。\n",
    "\n",
    "了解你的计算线性代数。当有疑问时，查找`scipy.linalg`，并且用`%timeit`来试一下替代方案。\n",
    "\n",
    "## 2.4.4 写更快的数值代码\n",
    "\n",
    "关于numpy的高级使用的讨论可以在[高级numpy](http://scipy-lectures.github.io/advanced/advanced_numpy/index.html#advanced-numpy)那章，或者由van der Walt等所写的文章[NumPy数组: 一种高效数值计算结构](http://hal.inria.fr/inria-00564007/en)。这里只是一些经常会遇到的让代码更快的小技巧。\n",
    "\n",
    "- 循环向量化\n",
    "\n",
    "    找到一些技巧来用numpy数组避免循环。对于这一点，掩蔽和索引通常很有用。\n",
    "\n",
    "- 广播\n",
    "\n",
    "    在数组合并前，在尽可能小的数组上使用广播。\n",
    "\n",
    "- 原地操作"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 33.5 ms per loop\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ipykernel/__main__.py:1: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future\n",
      "  if __name__ == '__main__':\n"
     ]
    }
   ],
   "source": [
    "a = np.zeros(1e7)\n",
    "\n",
    "%timeit global a ; a = 0*a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100 loops, best of 3: 8.98 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit global a ; a *= 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**注意**: 我们需要在timeit中`global a`，以便正常工作，因为，向a赋值，会被认为是一个本地变量。\n",
    "\n",
    "- 对内存好一点：使用视图而不是副本\n",
    "\n",
    "复制一个大数组和在上面进行简单的数值运算一样代价昂贵："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ipykernel/__main__.py:1: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future\n",
      "  if __name__ == '__main__':\n"
     ]
    }
   ],
   "source": [
    "a = np.zeros(1e7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 28.2 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit a.copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 33.4 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit a + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 注意缓存作用\n",
    "\n",
    "    分组后内存访问代价很低：用连续的方式访问一个大数组比随机访问快很多。这意味着在其他方式中小步幅会更快（见[CPU缓存作用](http://scipy-lectures.github.io/advanced/advanced_numpy/index.html#cache-effects)）:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ipykernel/__main__.py:1: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future\n",
      "  if __name__ == '__main__':\n"
     ]
    }
   ],
   "source": [
    "c = np.zeros((1e4, 1e4), order='C')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 5.66 times longer than the fastest. This could mean that an intermediate result is being cached \n",
      "1 loops, best of 3: 80.9 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit c.sum(axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 79.7 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit c.sum(axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(80000, 8)"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c.strides"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这就是为什么Fortran顺序或者C顺序会在操作上有很大的不同："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "a = np.random.rand(20, 2**18)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "b = np.random.rand(20, 2**18)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 23.8 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit np.dot(b, a.T)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "c = np.ascontiguousarray(a.T)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 22.2 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit np.dot(b, c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注意，通过复制数据来绕过这个效果是不值得的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 42.2 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit c = np.ascontiguousarray(a.T)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用[numexpr](http://code.google.com/p/numexpr/)可以帮助自动优化代码的这种效果。\n",
    "\n",
    "- 使用编译的代码\n",
    "\n",
    "    一旦你确定所有的高级优化都试过了，那么最后一招就是转移热点，即将最花费时间的几行或函数编译代码。要编译代码，优先选项是用使用[Cython](http://www.cython.org/)：它可以简单的将Python代码转化为编译代码，并且很好的使用numpy支持来以numpy数据产出高效代码，例如通过展开循环。\n",
    "\n",
    "---\n",
    "**警告**：对于以上的技巧，剖析并计时你的选择。不要基于理论思考来优化。\n",
    "\n",
    "---\n",
    "\n",
    "### 2.4.4.1 其他的链接\n",
    "\n",
    "- 如果你需要剖析内存使用，你要应该试试[memory_profiler](http://pypi.python.org/pypi/memory_profiler)\n",
    "- 如果你需要剖析C扩展程序，你应该用[yep](http://pypi.python.org/pypi/yep)从Python中试着使用一下[gperftools](http://code.google.com/p/gperftools/?redir=1)。\n",
    "- 如果你想要持续跟踪代码的效率，比如随着你不断向代码库提交，你应该试一下：[vbench](https://github.com/pydata/vbench)\n",
    "- 如果你需要一些交互的可视化为什么不试一下[RunSnakeRun](http://www.vrplumber.com/programming/runsnakerun/)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
